%load_ext autoreload
%autoreload
# lines above reload mycode so that any changes are pulled in
from mycode import *
One of the many issues that underlie determining the effectiveness of mask usage is the wide variety of masks available for purchase, which can be grouped into one of three categories: cloth, surgical, or N95. A 2020 study (Konda, 2020) sought to capture the efficiencies of various common cloth materials at filtering respiratory droplets. It was found that the filtering efficiency of cotton, one of the most commonly used fabrics in cloth masks, is highly dependent on its thread count. For the SARS-CoV-2 coronavirus, which averages approximately 60nm to 140nm in diameter (Bar-On, 2020), single-layer cotton fabric has a filtration efficiency of approximately 5% to 65%, depending on the tightness of its weave. Moreover, this filtration efficiency can be improved through the use of multiple fabric layers and the use hybrid materials, such as cotton-silk or cotton-flannel.
Unlike cloth masks, which gained popularity during the COVID-19 pandemic, surgical and N95 masks have been commonplace in both healthcare and industrial settings for decades and, as such, have been tested more extensively. While many of these studies have been performed in vitro (Johnson et al., 2009; Fabian et al., 2011; Patel et al., 2016), one conducted in 2020 (Liung et al., 2020) used 246 participants infected with either influenza, rhinovirus, or seasonal coronavirus. For the ten individuals infected with seasonal coronavirus, surgical face masks offered a significant reduction in the number of coronavirus particles present in aerosols. The same, however, could not be said for those infected with either influenza or rhinovirus. This finding suggests that mask usage might be an appropriate control measure for COVID-19, though the behavioral differences between seasonal coronavirus and the SARS-CoV-2 coronavirus are still unclear.
One important factor that must be considered in addition to the filtering efficiencies of various mask types is the level of adherence to mask usage in public spaces. It has been found that transmission probability per contact is reduced by mask wearing, particularly when compliance is high (Howard et al., 2021; MacIntyre et al. 2009). This finding is further supported an evidence review published in 2020 (Liang et al.), which concluded that “masking with prudent implementation and high compliance is a prerequisite to ensuring successful protection.” Furthermore, it has been found that adherence levels to preventative protocols is dependent on the perceived risk of the virus (Cava et al., 2005), implying that a high level of perceived risk might be required to achieve a high level of compliance.
At the time of writing, no relevant studies could be found that used the number of confirmed cases of COVID-19 as a metric for gauging mask effectiveness. This is likely due to the challenge of measuring mask compliance in an area. Using self-reported mask usage data, this paper will seek to discover if mask compliance in the United States had a statistically significant effect on the spread of COVID-19 during the observation period of June 10, 2020 through July 21, 2020.
In order to control for relevant population metrics, the mask usage dataset was joined with the U.S. Census Bureau’s 2018, five-year aggregate, American Community Survey, also gathered from BigQuery’s public datasets. The statistics chosen include ethnicity, age, median income, and population determined poverty status. These factors were chosen in accordance with ethnicity-related differences and age-related differences in COVID-19 risk as determined by the Center for Disease Control (CDC). In an effort to protect the data from being skewed by large populations, the raw data for age, sex, and race statistics were converted to population proportions in each county.
In order to control for the effects of various COVID-19 policies implemented in each state, the Oxford Policy Tracker table, again provided through BigQuery’s public datasets, was joined with the other data. This table provides insight into the stringency of eight categories of COVID-19-related policies, measured on a scale of 0 to 3. In the initial exploration of the data, it was discovered that 16 states decreased the stringency of at least one of their policies during the first week of the period under observation. Additionally, 12 states made changes to the stringencies of at least one policy during the remainder of the observed period. In order to avoid underreporting the effect of these policies on COVID-19 transmission, the maximum stringency of each policy reported during the observation period was used during analysis.
Since mask usage appears to have a greater effect on more densely populated areas (Aabed, Lashin, 2021), the population density for each county was gathered from the U.S. Census Bureau and uploaded into BigQuery; this data comes from the Average Household Size and Population Density dataset. Of the 3,142 counties in the United States, only one lacked a recorded population density; as such, this county was removed from the dataset. Additionally, once in BigQuery, the state to which each county belongs was assigned to one of four regions as designated by the U.S. Census Bureau, using indicator variables. This classification was performed in order to control for the effects of various climate conditions on the spread of COVID-19 (Cao et al., 2021). Finally, a dataset was gathered from the U.S. Census Bureau which classifies counties as either metropolitan or non-metropolitan. This data was used to create an indicator variable for the dataset, classifying each county into either category, so that interaction variables could be constructed between a county's metropolitan status and its level of mask adherence.
Similar to the mask usage data, the data for the dependent variable in the regression were gathered from BigQuery’s public datasets. The COVID-19 U.S.A Facts Summary, provided by the Center for Disease Control (CDC), offers daily time series data on the cumulative number of COVID-19 cases reported in each county. These raw numbers were manipulated to provide the weekly average of new cases in each county,as will be discussed in the next section.
The Dependent Variable</br> During the time that Dynata conducted its survey on mask usage, the U.S. was witnessing a spike in the number of new COVID-19 cases. Though this spike was not as drastic as the one witnessed at the end of 2020, it still represents an increase in the weekly average of new cases of approximately 38,000 cases per week over a five-week period. In Figure 1 below, let MA(t1) and MA(t2) denote the moving averages of new cases per 10,000 individuals at the trough and peak, respectively, of the spike under observation. For each of the 3,141 counties in the dataset, the dependent variable for this analysis, Yi , can then be calculated as show below.
</br>
Figure 1. Daily trends in the number of COVID-19 cases in the united states reported to the CDC.
As discussed in the previous section, the data in this table reported cumulative daily cases in each U.S. county. In order to achieve the average number of weekly cases in each county, several steps were taken. First, due the massive amount of data conatined within it, all dates outside of the weeks corresponding to the trough and peak of the period under observation were filtered out of the table using a simple where clause. In order to calculate new case numbers from the existing cumulative data, the lag function was used to copy the case information contained in the previous record; this lag was performed in a partition by county, ordered by date, to ensure that the previous record contained the cumulative case data from the previous day within the same county. Daily new cases were then calculated by taking the difference between these values. The remainder of the query is comprised of the following steps:
Since this query plays an integral role in the construction of each of the datasets required for this project, it has been included for the reader to view. The function below generates the query and displays each of its 9 stages.
#Uncomment line below to view query
#show_new_cases()
An initial exploratory analysis of this variable revealed several key insights. A table of descriptive statistics, generated by a built-in function of the pandas Python extension, as well as an analysis of the percentiles of the variable, revealed that a small percentage of the data contains relatively extreme values. The histogram below shows these values to be densely clustered around zero and heavily positively skewed. Additionally, a chloropleth map of Y, shown below the histogram, has been included. Construction of this map intially involved joining the project dataset with another dataset of U.S. Census block shapes, aggregated up to county-level shapes using one of BigQuery's geospatial aggregation functions. This strategy was quickly replaced, however, after causing several instances to crash and significantly slowing performance. It is now created by through pandas' reading of an internet file located in a GitHub repository hosted by Plotly, the package with which nearly all visualizations in this report have been produced, using the urllib urlopen() and the pandas csv.read() functions.
display(cases_hist.show())
display(cases_chlor.show())
None
None
The Main Dependent Variable</br> The New Yorks Times' mask usage by county table contains 3,142 records, one for each county recognized by the U.S. Census Bureau. As will be discussed in the next section, four separate models were created for thise project as robustness checks; the sole differences between each model lie in the mask usage variables. In the primary model, all five mask usage categories are considerd. The three alternative models, however, aggregate these categories in different ways, as shown in the table below. Each of these aggregations was performed using a combination of SQL case statements and sum aggregation functions. The aggregations for each model are as follows:
Population Controls</br> The data gathered from the American Community Survey (ACS) provides the ability to control for population demographics that might otherwise have contributed to the spread of COVID-19 in a given county. Fortunately, this data is gathered and analyzed at the county level, eliminating the need to aggregate up from lower levels of the U.S. Census Bureau's georgraphical heierarchy. In joining this table with the rest of the data, no rows were lost, except some belonging to the ACS pertaining to "administrative subdivisions" for which no mask usage data was gathered. As discussed previously, demographic statistics, originally expressed as raw numbers in the ACS dataset, were converted into population proportions of each, by simply dividing the values by the county's total population. Once this transformation was performed, the raw numbers were then excluded from the dataset.
Government Response Controls</br> The data in this table is reported for several nations in addition to the United States. As such, the policy stringencies of the other nations were filtered out using the region_code, part of the table's composite key. The leading two characters in this value pertain to the country, whereas the remaining two correspond with regions within it. The SQL substring function was used to distinguish between the two and match them with the code pertaining to the U.S., as well as those pertaining to each state. Each state's levels of stringency were then determined for both the beginning and end of the observation period, and the lag function again used to get each on the same row. Finally, The SQL aggregation function max was used to determine the maximum level of stringency implemented for each policy during the period.
Population Density Controls</br> When joining the population densities dataset, it was discovered that leading zeros had been dropped from each applicable FIPS code, causing the join statement to return considerably less rows than required. To fix this issue, the FIPS codes were cast as strings and the SQL length function used to find codes with fewer than the required five digits. A case statement was used to append a leading zero to all codes determined to have fewer than five digits. This resolved the issue and caused the join to return the expected 3,141 records.
In additon to fixing the issue with the FIPS codes, states and, subsequently, the counties contained within were categorized into one of four U.S. Census Bureau-defined regions: Midwest, Northeast, South, or West. This was done by first creating a new column of state codes through the use of the SQL substring function. Since the first two digits of county FIPS codes correspond with that county's state code, these values were stripped and placed in a new column titled "state." Four case statements were then used to create one column for each region and match each record's state code with a list of state codes provided in each statement.
Finally, the metropolitan counties table provided data that allowed for the binary classification of each county as either metropolitan or non-metropolitan. Though the U.S. Census Buruea recognizes a total of 6 classifications of this sype, the two used in this analysis are the broadest categories of which all 6 are subsets. As with the population densities table mentioned above, leading zeros of county FIPS codes had been truncated and, as such, had to be added back using the method outlined previously. Once that issues had been resolved, a new column titled "metropolitan" was created and each county assigned either a 0 or 1, depending on their classification, using a case statement.
Exploring the relationship between mask use and population density was a large part of the exploratory analysis. The scatterplots below for the Primary Model and Alternative Model 3 appear to indicate that high levels of mask adherence might, in fact, be correlated with population density, as proposed in the studies cited earlier. This finding is also supported by the analyses performed later on. The reader is encouraged to alter the visualization by clicking and double-clicking on the lenged for each, altering the plot. In each visualization, larger values of mask usage variables corresponding with higher levels of adherence appear to have larger cricles, indicating more densely populated counties, clustered around the black line at $x=0$. Conversely, as those same values decrease, larger circles can be found at increasing values along the x-axis, indicating higher levels of COVID spread during the observation period.
display(fig0)
display(fig3)
The Model</br> As discussed, four models were constructed for this project, differing only in the way in which mask usage was categorized. In addition to the mask usage variables, the remainder of the model is comprised of control and interaction variables. In total, 26 control variables and one interaction term per mask usage variable were used. Each interaction term was defined as metropolitan*[mask usage variable], where metropolitan is an indicator variable designating each county as either a metropolitan or non-metropolitan area. The motivation behind these terms is the notion that mask usage is more beneficial in densely populated areas, as proposed by Aabed and Lashin (2021). When combined with the variables pertaining to mask usage, the following regression equations are achieved:
For the purposes of this paper, let $Y$ represent the $3141\times 1$ vector of of observed values $Y_i$, for $1 \leq i \leq 3141$. Furthermore, let $X$ represent the design matrix consisting of $1, X_{1,i}, X_{2,i}, \dots, X_{n,i},$ where $n \in (27,29,33)$ and $1 \leq i \leq 3141$. Let $\beta$ represent the $n\times 1$ vector of regression coefficients $\beta_1,\dots \beta_n$ where, again, $n \in (27,29,33)$. Finally, let $\varepsilon$ represent the $3141\times 1$ vector of disturbance terms, $\varepsilon_i$, for $1 \leq i \leq 3141$. Each model, then can simply be written as $Y=X\beta + \varepsilon$.
The Model</br> Statistical analysis and calculations were performed using R version 4.1.2. After preparing the data for analysis, a linear model was fit to the data and the residuals and predicted values saved for use in diagnostics. In order to visually assess the fit of the primary model, the observed values of Y were plotted against the predicted values of Y, the residuals were plotted against the predicted values of Y, and a normal QQ-plot of the residuals was created. From the analysis of these plots, it was concluded that curvature might be present in the model and that the error terms were both highly heteroskedastic and nonnormal. The plots generated for each of the alternative models yielded similar results.
In addition to the visual diagnostics performed above, formal testing was conducted using the Shapiro-Wilk and Brown-Forsythe tests. Both returned p-values less than the predetermined level of significance, 0.05, resulting in the rejection of both null hypotheses and the conclusion that the errors are both heteroskedastic and nonnormal. This violates the Gauss-Markov assumptions that the error terms have constant variance and be normally distributed. As such, a transformation of the dependent variable was performed using the Box-Cox method.
Box-Cox Transformation</br> Given that negative values were present in the vector $Y$, a translation was required before the power parameter, $\lambda$, could be determined. Given that Y has a minimum value of approximately -65.67 and a range of roughly 113, two different translations were considered (see Table 2). Let $\delta$ represent the value of the translation, so that $Y_t$=$Y+\delta$.
Given a power parameter $\lambda\in R$, the Box-Cox transformation is given by the equation below. $$ \widetilde{Y}= \left\{ \begin{array}{ll} {Y_t}^\lambda & if \;\lambda \neq 0 \\ ln(Y_t) & if \;\lambda=0 \\ \end{array} \right. $$
$ \widetilde{Y}$ was calculated for each model and translation, resulting in a total of eight transformed models of the form $\widetilde{Y} = X\beta+\varepsilon$. The Shapiro-Wilk and Brown-Forsythe tests were again run on each of these models in order to determine if the transformation eliminated the heteroskedasticity and nonnormality of the error terms. The results of the transformation are discussed in the next section.
Stepwise Regression</br> Due to the issues encountered during analysis, stepwise regression was performed as a final robustness check to determine if mask usage might have a statistically significant impact on the spread of COVID-19. The motivation behind this approach was that if a stepwise model was constructed using some, or all, of the mask usage variables, then this would serve as evidence of a potential effect of the latter on the former. Using R’s built-in stepwise regression function, this process was performed on each of the models under consideration, producing a total of four models with considerably reduced dimensionality. The results of this process are discussed in the next section.
</br>
The Box-Cox transformation, performed in an attempt to remedy the issues present in the errors, did not produce a useful result. Table 3 displays all eight transformations performed, as well as their respective Shapiro-Wilk and Brown-Forsythe test p-values. Though each transformation did increase the p-value returned from the Brown-Forsythe test in comparison to the untransformed model, it failed to do so to the point that the null hypothesis, homoskedasticity of the errors, could not be rejected. Similarly, not enough progress was made in regard to the p-value of the Shapiro-Wilk test that normality of the errors could not be rejected. Similar to the untransformed models, none of the mask usage variables in the transformed models were determined to be statistically significant.</br>
</br>
Finally, the results of the stepwise regression analysis can be seen in Table 3. For each model, the number of explanatory variables was reduced by roughly half, resulting in a more accurate model, as evidenced by their respective Akaike Information Criteria. These models, however, should not be interpreted as valid in regard to the theory behind the spread of COVID-19. Rather, these results should be interpreted as indicators that regression results might conclude a statistically significant impact of mask usage variables on Y once the issues with the error terms have been resolved. Though not necessarily in-line with the theory of disease spread, this approach did produce the best results, as evidenced through f-test comparisons of all models, as well as RMSE cross-validation using an 80:20 training/validation split, respectively. Specifically, the stepwise regression of the Primary model yielded the lowest RMSE of 1.4546. In contrast, both Alternative Models 1 and 2 produced the worst results in regard to RMSE. Despite this, neither the stepwise Primary Model nor the stepwise Alternative Model 1 should be considered good predictors of COVID spread. The map below gives a visual summary of the R-squared values for each state produced by the best model constructed thus far, the stepwise Primary Model. It can be seen that the model results in poor fit for the majority of states, though in some states it fares well. Cumulatively, the model has an R-squared value of approximately 0.10.
</br>
display(best_model_chlor)
Additionally, because daily infection rates for each county are available, running a time-series regression should be considered. This study focused simply on the difference in new cases witnessed at the beginning and ending of the observational period. It is entirely possible, and even intuitive, that case numbers at time $t$ are dependent on case numbers at time $t-1$. As such, constructing a time series model with the average new cases reported each week during the five-week observation period might reduce some of the noise and decrease the presence of heteroskedasticity. It should be noted, however, that mask usage would be assumed to be constant during each week, as was assumed in this analysis.
Finally, the potential biases in this model should not be ignored. As discussed earlier, the self-reported nature of Dynata’s mask usage survey introduces the potential for both self-selection and response bias. Furthermore, because mask compliance is likely dependent on the perceived risk of serious infection or death, endogeneity bias might be present in the model as it currently stands.